perm filename KWIC.LES[UP,DOC] blob
sn#084242 filedate 1977-03-09 generic text, type C, neo UTF8
COMMENT ā VALID 00002 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 KWIC -- a keyword in context program -- L. Earnest, December 1973
C00009 ENDMK
Cā;
KWIC -- a keyword in context program -- L. Earnest, December 1973
This program can be used to produce a concordance, index, word count,
or word list for any given text file. The simplest command that it
understands is:
*<source file name>
This causes the source file to be scanned for words, which are
compared with an internal dictionary of common words. Any that are
not in the dictionary are considered to be "keywords". The program
produces an output file, in this case called <source>.KWC, that
contains an alphabetized list of keywords, one per line, together
with the local context and a reference to the page and line on which
they occur. It also lists the number of occurrences of each
dictionary word. A typical output might begin as follows.
Concordance of SIGNUP[W,LES]
275 keywords, 961 dictionary words
47 a
5 about
Page Line ------
5 22 A roll of adhesive tape or electrical tape.
6 after
Page Line ------
1 30 August 16 at noon in the AI Conference Room.
2 air(s)(ed)(ing)
3 all
Page Line ------
3 15 If you come to an ambiguous fork in the trail, preferably
1 among
......
Numbers appearing just to the left of center are word counts for
dictionary words (with various suffixes), while the page and line
numbers point to the locations of keywords in the original document.
Line numbers are counted from the top of the page. SOS line numbers
(if any) are ignored, as are TV/E directory pages, though the page
numbering includes the directory. Words beginning with different
letters of the alphabet are placed on different output pages.
General Command Format
The more general command format is:
*[<output file>ā]<source file>[/ONLY | /ALL][/INDEX | /COUNT | /LIST]
where bracketed elements are optional and alternative switches are
separated by "|". Both source and output files must be on the disk.
All switches may be abbreviated to one letter. The /ONLY switch
causes only keywords to be listed in the output file (i.e. omitting
counts of dictionary words). The /ALL switch causes the dictionary
to be ignored, so ALL words are treated as keywords. (Beware: a
concordance produced with the ALL switch on is typically about 10
times the size of the original document.
The /INDEX switch causes the context to be omitted and produces a
three-column listing of words and their original locations (page and
line) or number of occurrences (dictionary words). The /COUNT switch
causes word counts only to be generated for keywords and produces a
four-column listing of these counts. The /LIST switch produces a
raw, seething word list (i.e. an alphabetized list of all words
used), one per line, with no header information, and all on one long
page.
Scanning Procedure
KWIC treats as a word any alphanumeric string beginning with a letter
and possibly containing "'", "-", or "/", but nothing else. Thus,
things beginning with digits are ignored. Words hyphenated over line
boundaries are reassembled.
In order to provide as much context as possible for each keyword, the
text is "dejustified" within each paragraph, so that redundant spaces
between words are removed and successive lines are concatenated, with
a <space> replacing the <CRLF>. A new paragraph is assumed to begin
whenever there is a blank line, a <TAB> in column 1, or a <form
feed>.